EuroGOV: Engineering a Multilingual Web Corpus

نویسندگان

  • Börkur Sigurbjörnsson
  • Jaap Kamps
  • Maarten de Rijke
چکیده

EuroGOV is a multilingual web corpus that was created to serve as the document collection for WebCLEF, the CLEF 2005 web retrieval task. EuroGOV is a collection of web pages crawled from the European Union portal, European Union member state governmental web sites, and Russian government web sites. The corpus contains over 3 million documents written in more than 20 different European languages. In this paper we provide a detailed description of the EuroGOV collection.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web Retrieval Experiments with the EuroGOV Corpus at the University of Hildesheim

In the CLEF 2005 initiative, multlingual web retrieval was integrated as a task for the first time. This paper describes experiments based on one multilingual index carried out at the University of Hildesheim. Several indexing strategies based on a multi-lingual index have been tested with the EuroGOV corpus. Boosting topic fields with higher weight led to best results during post submission ru...

متن کامل

Discovering Parallel Text from the World Wide Web

Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents ...

متن کامل

Lexical Database for Multiple Languages: Multilingual Word Semantic Network

Data mining and knowledge engineering have become a tough task due to the availability of large amount of data in the web nowadays. Validity and reliability of data also become a main debate in knowledge acquisition. Besides, acquiring knowledge from different languages has become another concern. There are many language translators and corpora developed but the function of these translators an...

متن کامل

Demo of iMAG Possibilities: MT-postediting, Translation Quality Evaluation, Parallel Corpus Production

An interactive Multilingual Access Gateway (iMAG) dedicated to a web site S (iMAG-S) is a good tool to make S accessible in many languages immediately and without editorial responsibility. Visitors of S as well as paid or unpaid post-editors and moderators contribute to the continuous and incremental improvement of the most important textual segments, and eventually of all. Pre-translations are...

متن کامل

A Multilingual Information Retrieval Tool Hierarchy for a WWW "Virtual Corpus"

The article addresses: 1. the design of an information retrieval (IR) toolkit, named as the Multilingual Information Retrieval Tool Hierarchy (MIRTH) search engine, which works with virtual corpora on the World Wide Web, also known as the Web or WWW for short. It is motivated by the desire to create a multilingual search engine to retrieve information by accessing a virtual corpus; 2. the imple...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005